The concept of ‘dummy variables’ in economics/regression is a useful point to start thinking about text:
| Topic | Dummy |
|---|---|
| News | 0 |
| Culture | 1 |
| Politics | 2 |
| Entertainment | 3 |
| Document | UK | Top | Pop | Coronavirus |
|---|---|---|---|---|
| News item | 1 | 1 | 0 | 1 |
| Culture item | 0 | 1 | 1 | 0 |
| Politics item | 1 | 0 | 0 | 1 |
| Entertainment item | 1 | 1 | 1 | 1 |
Just like a one-hot (binarised approach) on preceding slide but now we count occurences:
| Document | UK | Top | Pop | Coronavirus |
|---|---|---|---|---|
| News item | 4 | 2 | 0 | 6 |
| Culture item | 0 | 4 | 7 | 0 |
| Politics item | 3 | 0 | 0 | 3 |
| Entertainment item | 3 | 4 | 8 | 1 |
Enter, stage left, scikit-learn:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
# Non-reusable transformer
vectors = vectorizer.fit_transform(texts)
# Reusable transformer
vectorizer.fit(texts)
vectors1 = vectorizer.transform(texts1)
vectors2 = vectorizer.transform(texts2)
print(f'Vocabulary: {vectorizer.vocabulary_}')
print(f'All vectors: {vectors.toarray()}')Builds on Count Vectorisation by normalising the document frequency measure by the overall corpus frequency. Common words receive a large penalty:
\[ W(t,d) = TF(t,d) / log(N/DF_{t}) \]
For example:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
# Non-reusable form:
vectors=vectorizer.fit_transform(texts)
# Reusable form:
vectorizer.fit(texts)
vectors = vectorizer.transform(texts)
print(f'Vocabulary: {vectorizer.vocabulary_}')
print(f'Full vector: {vectors.toarray()}')Three input texts with a distance weighting (\(d/2\), where \(d<3\)):
| fluffy | mat | ginger | sat | on | cat | the | |
|---|---|---|---|---|---|---|---|
| fluffy | 1 | 1 | 0.5 | 0.5 | 2.0 | ||
| mat | 0.5 | 1.5 | |||||
| ginger | 0.5 | 0.5 | 1.0 | 1.5 | |||
| sat | 3.0 | 3.0 | 2.5 | ||||
| on | 1.5 | 3.0 | |||||
| cat | 2.0 | ||||||
| the |
The problem:
Cleaning is necessary, but it’s not sufficient to create a tractable TCM on a large corpus.
Typically, some kind of 2 or 3-layer neural network that ‘learns’ how to embed the TCM into a lower-dimension representation: from \(m \times m\) to \(m \times n, n << m\).
Similar to PCA in terms of what we’re trying to achieve, but the process is utterly different.
Requires us to deal in great detail with bi- and tri-grams because negation and sarcasm are hard. Also tends to require training/labelled data.
| Cluster | Geography | Earth Science | History | Computer Science | Total |
|---|---|---|---|---|---|
| 1 | 126 | 310 | 104 | 11,018 | 11,558 |
| 2 | 252 | 10,673 | 528 | 126 | 11,579 |
| 3 | 803 | 485 | 6,730 | 135 | 8,153 |
| 4 | 100 | 109 | 6,389 | 28 | 6,626 |
| Total | 1,281 | 11,577 | 13,751 | 11,307 | 37,916 |
Learning associations of words (or images or many other things) to hidden ‘topics’ that generate them:
Basically any of the lessons on The Programming Historian.
::: {column width=“50%”} - Introduction to Word Embeddings - The Current Best of Universal Word Embeddings and Sentence Embeddings - Using GloVe Embeddings - Working with Facebook’s FastText Library - Word2Vec and FastText Word Embedding with Gensim - Sentence Embeddings. Fast, please!
::: {column width=“50%”} - PlasticityAI Embedding Models - Clustering text documents using k-means - Topic extraction with Non-negative Matrix Factorization and LDA - Topic Modeling with LSA, pLSA, LDA, NMF, BERTopic, Top2Vec: a Comparison ::: ::::
Analysing Text • Jon Reades